2025-04-05 09:39:19.AIbase.16.9k
DeepSeek and Tsinghua University Joint Research: Innovative Reward Model Inference Method Improves Scalability
Researchers from DeepSeek and Tsinghua University recently published a new paper exploring scaling methods for reward model inference, seemingly advancing DeepSeek R2. Reinforcement learning is widely used in the large-scale post-training phase of large language models, but faces the challenge of obtaining accurate reward signals for these models. The researchers found that using pointwise generative reward modeling (GRM) improves model adaptability and scalability during inference. To this end, they propose Self-Principle Calibration Tuning (SPCT) learning.